Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky test execution caused by Thread #34966

Merged
merged 1 commit into from
Nov 27, 2024
Merged

Fix flaky test execution caused by Thread #34966

merged 1 commit into from
Nov 27, 2024

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Nov 27, 2024

What does this PR do?

As discussed offline:

Would it possible if we change Thread to Process below 🙏 ?
I know it's not good to change code that is working and making sense for the sole purpose of testing, but Thread (without join) here making some test(s) flaky failing
test_cached_model_has_minimum_calls_to_head
where we count the number of calls to Hub. Some other tests that (eventually) calling from_pretrained entering Thread line will affect test_cached_model_has_minimum_calls_to_head
because they are running at the same time in the same process.
There are other solutions, but the changes involves patching (a lot of) tests (so at conftest level which is not good neither IMO.


To check the PR works (make sure tensorflow is available):

python gradio2.py

where script.py

for i in range(20):
    import os
    os.system('python -m pytest -v tests/models/auto/test_modeling_auto.py tests/models/auto/test_modeling_tf_auto.py -k "test_from_pretrained_identifier or test_cached_model_has_minimum_calls_to_head"')
    # or running several times on CI workflow: it's flaky and not easy to reproduce
    # or add `time.sleep(0.5)` at the beginning of `auto_convenanorsion`

@ydshieh ydshieh requested a review from LysandreJik November 27, 2024 10:31
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@ydshieh ydshieh merged commit 5f8b24e into main Nov 27, 2024
25 checks passed
@ydshieh ydshieh deleted the fix_flaky_2 branch November 27, 2024 15:32
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
@loadams
Copy link
Contributor

loadams commented Dec 5, 2024

Hi @ydshieh - FYI this change appears to have broken some unit tests in DeepSpeed, specifically where we download a model. I'll look at ways to resolve this, I assume it is an issue in multiprocessing/threading.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Dec 5, 2024

@loadams Thank you for informing us. Hope there is a way that could work in both cases 🙏 Let me know if there is anything we can help here.

BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
@tomaarsen
Copy link
Member

tomaarsen commented Dec 12, 2024

@ydshieh I'm also experiencing issues here for Sentence Transformers/Cross Encoder models, see UKPLab/sentence-transformers#3129

In short: multiprocessing.Process never works when not inside of __name__ == "__main__". I recognize that most programs should be using that line, but I'd rather not force it on my users.

If one of my users loads any model that only has a pytorch_model.bin, then it'll fail, e.g.:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("embaas/sentence-transformers-gte-base")

or

from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

which internally call

from transformers import AutoModel

model = AutoModel.from_pretrained("embaas/sentence-transformers-gte-base")

or

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("cross-encoder/ms-marco-MiniLM-L-6-v2")

All of these get:

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html
  • Tom Aarsen

@ydshieh
Copy link
Collaborator Author

ydshieh commented Dec 12, 2024

Hi @ydshieh - FYI this change appears to have broken some unit tests in DeepSpeed, specifically where we download a model. I'll look at ways to resolve this, I assume it is an issue in multiprocessing/threading.

Hi @loadams FYI: it's reverted back to use Thread in #35236 (as there are other issues reported)

@loadams
Copy link
Contributor

loadams commented Dec 12, 2024

should be using that line, but I'd rather not force it on m

@ydshieh - thanks for the update!

loadams added a commit to microsoft/DeepSpeed that referenced this pull request Dec 16, 2024
…ues in tests (#6822)

Changes from huggingface/transformers#34966
caused the `nv-torch-latest-v100` tests to fail with the following
error:

```
  File "/tmp/azureml/cr/j/e4bfd57a509846d6bbc4914639ad248d/exe/wd/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3941, in from_pretrained
    raise EnvironmentError(
OSError: Can't load the model for 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
```

Sample failure here:
https://github.com/microsoft/DeepSpeed/actions/runs/12169422174/job/33942348835?pr=6794#step:8:3506

This was resolved on the Transformers side here:
huggingface/transformers#35236
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants